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Abstract 

Introduction. Environmental problems arising in shallow waters and caused by both natural and man-made factors 
annually do significant damage to aquatic systems and coastal territories. It is possible to identify these problems in a 
timely manner, as well as ways to eliminate them, using modern computing systems. But earlier studies have shown 
that the resources of computing systems using only a central processor are not enough to solve large scientific problems, 
in particular, to predict major environmental accidents, assess the damage caused by them, and determine the 
possibilities of their elimination. For these purposes, it is proposed to use models of the computing system and 
decomposition of the computational domain to develop an algorithm for parallel-pipeline calculations. The research 
objective was to create a model of a parallel-conveyor computational process for solving a system of grid equations by a 
modified alternating-triangular iterative method using the decomposition of a three-dimensional uniform computational 
grid that takes into account technical characteristics of the equipment used for calculations. 

Materials and Methods. Mathematical models of the computer system and computational grid were developed. The 
decomposition model of the computational domain was made taking into account the characteristics of a heterogeneous 
system. A parallel-pipeline method for solving a system of grid equations by a modified alternating-triangular iterative 
method was proposed. 

Results. A program was written in the CUDA C language that implemented a parallel-pipeline method for solving a 
system of grid equations by a modified alternating-triangular iterative method. The experiments performed showed that 
with an increase in the number of threads, the computation time decreased, and when decomposing the computational 
grid, it was rational to split into fragments along coordinate z by a value not exceeding 10. The results of the 
experiments proved the efficiency of the developed parallel-pipeline method. 

Discussion and Conclusion. As a result of the research, a model of a parallel-pipeline computing process was 
developed using the example of one of the most time-consuming stages of solving a system of grid equations by a 
modified alternating-triangular iterative method. Its construction was based on decomposition models of a three- 
dimensional uniform computational grid, which took into account the technical characteristics of the equipment used in 
the calculations. This program can provide you for the acceleration of the calculation process and even loading of 
program flows in time. The conducted numerical experiments validated the mathematical model of decomposition of 
the computational domain. 
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AHHOTauHA 

Beedenue. Ikonornyeckue TpoOMeMbl, BOZSHUKAIOWIMe Ha MEJIKOBOJHBIX BOJOEMAX H BbI3bIBACMBIC KaK IIPHPOAHbIMH, 
TaK MW TeXHOreHHbIMM (bakTOpaMH, exKerOqHO HaHOCAT CyIeCTBeHHbIM yulepO akBacHcTeMaM uM UpHOpexkHbIM 
TeppuTOpHAM. CBoeBpeMeHHO OMpeeUTb ITH MpOOIeMbl, a TaKKe MYTH UX yCTpaHeHHA BO3MO2%KHO C MCHOIb30BaHHeM 
COBPeMCHHBIX BbIYMCJIMTeIbHBIX cHcTem. Ho mpoBeyzéHHble paHee UccleqOBaHHA WOKa3aIM, 4TO pecypcoB 
BBIYHCIIMTCIBHBIX CHCTeM, MCIOJb3YIOWIMX TOJIbKO WeHTpaIbHbIN Mpoweccop, HeAOcTaTOUHO WIA pewieHHA OOSIIHX 
Hay4HBIX 3a]{a4, B YACTHOCTH, NO NPOrHOSHPOBaHHIO KPYIHBIX IKOJOTHYCCKHX MPOMCLIeCTBHU, OWeHKe HaHeCceHHOrO 
MMU yijepOa HM OpeseseHuIO BOSMOX%KHOCTeH UX ycTpaHeHHaA. Ja 9TUX Wee NpelaraeTca MCHOML30BaTb MOEN 
BBIYHCIMTeIbHOM CHCTeEMbI HM JeKOMNO3HUMH pacuéTHOH oOsacTH WIA pa3spaOoTKM asIrOpHTMa MapaJiseIbHO- 
KOHBeHepHBIX BbIHCHeHuH. Lenbro WaHHOM paOoTEI ABJIAeTCA CO3qaHMe MOJeIM MapaliesIbHO-KOHBelepHoro 
BBIYHCIMTeIbHOTO Uporecca JI PeWICHHA CHCTeMbI CeTOUHBIX ypaBHeHHi MOAMPMUMpOBAaHHbIM TOMepeMeHHo- 
TPCYyIOJIbHbIM HTepallMOHHbIM MeTOJOM C HCHOb30OBAHHeEM J[CKOMMO3HIMH TpéxMepHo paBHOMepHoH pac4éTHOH 
CeTKH, YUHTHIBaIOLNeH TEXHHYECKHe XapakKTePHCTHKH UCMOIb3yeMOro JIA pacueTOB OOopyAOBaHHA. 

Mamepuanoit u memoovi. PaspaOotaHbl MaTeMaTHYecKHe MOJ{eIH BBIYMCIMTeIbHOM CHCTeMbI HM pacuéTHOM CeTKH. 
Mogetb eKOMMO3HUHM pacyéTHOM oOsacTH BbINOMHeHa C YyYETOM XapaKTepHCTHK TeTepOreHHOM CHCTeMBI. 
IIpequoxKeH MapasiesbHO-KOHBeHepHbIM MeTO PpelleHHA CHCTeMbI CeTOUHbIX ypaBHeHHi MOMHUMpOBaHHbIM 
HOMePeCMeHHO-TpeyFOsIbHbIM HTepallHOHHbIM MeTOOM. 

Pezyibmamei uccnedoesanua. Ha a3n1ke CUDA C nanmcana nporpamMa, peasmM3yrollad MapasviesIbHO-KOHBeHepHbIit 
MeTO PeLIeHHA CHCTeMbI CeETOUHBIX YpaBHeHHit MOAMPUUMpOBaHHBIM MOMePCMeCHHO-TpeyTOJbHbIM HTepal|MOHHbIM 
MeToyoM. IIpopewéHHbIe SKCIIepHMeHTbI NMoOKa3aIu, 4TO C YBeJIMYCHHeEM UHCa MOTOKOB BpeMA BbIMHCIeHHM 
yMeHbIUaeTCA HM TPH JeCKOMMO3HIMH pacuéTHOH CeTKH pal[MOHaJIbHbIM ABJIAeTCA pa30HeHHe Ha (pparMeHTBI 10 
KoopyuHate Z Ha BeIM4MHYy, He MpeBbiiuarouryio 10. PesynbtTaTbl 9KCIePHMeHTOB MOATBepAMIH 3PPeKTHBHOCTE 
pa3paOoTaHHOro MapasiesIbHO-KOHBeMepHoro MeTOsAa. 

O@6cyscdenue u 3aknio4uenua. Ilo uToraM IMpoBeyeHHbIX MCCIeOBaHHit pa3spadoTaHa MOJeyIb MapasiebHO- 
KOHBeMepHOrO BBIYHCIIMTeIbHOrO polecca Ha MpHMepe OAHOTO W3 CaMbIX TpyOEMKUX 3TANOB PeWICHHA CHCTeMBI 
CeTOYHBIX ypaBHeHHM MOAMPUUMpOBaHHBIM TOMepeMeHHO-TpeyroJIbHbIM HTepallMOHHbIM MeTOOM. Eé moctTpoeHue 
OCHOBaHO Ha MOJeIAX J©KOMNO3HUMH TpéxMepHOH paBHOMepHOH pacuéTHOM CeTKH, YYNTIBaIOLWelH TeXHHyecKHe 
XapaKTe€PpHCTHKH HCHOIb3yeMOro B pacueTax oOOopynoBaHuaA. IIpHMeHeHve IporpaMMBI MO3BOJIMT YCKOPHTb Tporecc 
pacuéTa WM paBHOMepHO 10 BpeMeHH 3arpy3HTb MporpaMMuble MoTOKH. I[poBexeHHbIe 4YHCIICHHbIe IKCIIEPHMCHTHI 
MOATBEP AHI MaTeMaTHYeCKyIO MOJeIIb JEKOMMO3HIMH pacuéTHO oONacTH. 


KoroueBble CJ10Ba: TlapasWIeJIbHBIM asIrOpHTM, BbIUMCIMTeIbHbIM Tipowecc, CCTOUHbIC ypaBHeCHHA 


BsaarojapHoctn: aBTOpbIl BbIpaxKatoT OlaroyjapHocTb peqakWHoHHOH KOJIICrHH WKYypHasiIa HU PeileH3eHTy 3a 
IIpod@eccvoHasIbHBIii aHasIn3 HW peKOMeCHaunu JIA KOPppeCKTHPOBKH CTaTbH. 


@uHancupospanne. Padota BEInouHeHa IpH MoszWepxKe Pocculickoro Hay4dHoro (ousa (mpoext Ne 21—71—20050). 
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Introduction. Recently, a number of serious environmental problems have been observed in the Rostov region. 
These include, in particular, the eutrophication of waters of the Sea of Azov and the Tsimlyansk reservoir, which causes 
the growth of harmful and toxic species of phytoplankton populations [1]. Engineering works in the waters of rivers and 
seas cause pollution of adjacent territories, changes in the population structure of biota, and deterioration of 
reproduction conditions of valuable and commercial fish. Climate change in the south of Russia has led to an increase in 
the number of cases of flooding of some territories in the area of the Taganrog Bay and the floodplain of the Don River 
caused by up and down surges. In the last decade, during the summer period, almost complete drainage of the Don 
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riverbed was observed several times, which led to a complete stop of navigation. To predict the occurrence and 
development of such cases, to plan ways to address their consequences, to assess the damage caused by them, modern 
software systems built using high-precision mathematical models, numerical methods, algorithms and data structures 
are needed [2]. 

Mathematical models used in predicting natural and man-made disasters are based on systems of partial differential 
equations, such as Poisson, Navier-Stokes, diffusion-convection-reaction, and thermal conductivity equations. The 
numerical solution to such systems causes the need for operational storage of large amounts of data (in arrays of various 
structures) and the solution to systems of grid equations of high dimension exceeding 10°. The amount of RAM 
required to store arrays of data when numerically solving only one Poisson equation for a three-dimensional domain 
with a dimension of 103x103x103 by an alternately triangular iterative method is more than 64 GB. In the case of 
numerical solution to combined tasks, hundreds of gigabytes of RAM are required, which can be accessed only when 
using supercomputer systems. 

An earlier study has shown that the resources of a computing system using only the CPU are not enough to solve 
such scientific problems [3]. Increasing the GPU power and video memory made it possible to use video adapter 
resources for calculations [4]. The GPU utilization depends on the application of parallel algorithms to solve 
computationally intensive problems of aquatic ecology [5—7]. To partially solve the problems of lack of memory and 
computing power on workstations, you can install additional video adapters in PCI-E X16 slots directly and in PCI-E 
X1 slots using PCI-E X1—PCI-E X16 adapters. Thus, the number of video adapters installed on one workstation can be 
increased to 12 [8-11]. 

Heterogeneous computer systems that provide sharing CPU and GPU resources are becoming increasingly popular 
in the scientific community. Application of such systems makes it possible to reduce the computation time of scientific 
problems [12-14]. However, the utilization of a heterogeneous computing environment involves the modernization of 
mathematical models, algorithms and programs that implement them numerically. A heterogeneous system provides 
organizing the calculation process in parallel mode. At the same time, fundamental differences in the construction of 
software systems using CPU and GPU together should be taken into account. 

Materials and Methods. We describe the proposed mathematical models of the computational system, the 
computational grid, as well as the method of decomposition of the computational domain. 

Let D_ beaset of technical characteristics of a computing system, then 

D=D'UD?UD;, (1) 


where D! — a subset of the characteristics of the central processing units (CPU) of a computing system; D? — a 

subset of the characteristics of video adapters (GPU) of a computing system; D? — a subset of RAM characteristics. 
D! =(d'',d2,d'3,d'4), (2) 

where d':! — total number of CPU cores; d'2 — number of streams simultaneously processed by one CPU core; 


d'3 — clock rate, MHz; d! — CPU bus frequency, MHz. 


D= UD? 


Kepu 


={d? | kgey € Kgpy @? e Dz}, (3) 


Kepu €Kepu 
where K py = Toe Newt — multiple video adapter indices; V,,,, — number of computer system video adapters; 


Keepy — video adapter index. Each video adapter is represented as a tuple 


ae ~ (dz, " ee ) : (4) 
where d ioe — amount of video memory of the video adapter with index k,,,,, GB; di; — number of streaming 
multiprocessors. 

D3= (a3, d3) , (5) 


where d3-! — total amount of RAM, GB; d3:2 — clock rate of RAM, MHz. 


Let S —a set of software streams involved in the computing process, then 


S=S'US?, 
S1={1.,N yg}, (6) 
St= U Sz, Sz, = Poo Ns \, 

Kepu €Kgpu kepu 
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where S! —a subset of program streams implementing the calculation process on the CPU; S2 —a subset of CUDA 
streaming blocks implementing the calculation process on GPU streaming multiprocessors; N,, — number of CPU 
program streams involved; Siow — a subset of CUDA streaming blocks that implement the calculation process on 


GPU streaming multiprocessors with index kgpy3 Kgpy = {I,---»Ngpy } — multiple GPU indices; N,,,, — number of 


GPU in the computing system; N,, _— number of CUDA stream blocks involved that implement the calculation 


process on GPU stream multiprocessors with index Kp, . 


Let E bea set of identifiers of program streams. Then, in order to identify program flows in the computing system, 


we assign tuple e of two elements to each element of the set of program flows S: 


VseS deek: e=(n,,n,), (7) 
where n, — index of a computing device in a computing system; n, — index of the CPU program stream or GPU 
streaming unit. 

_ $0, ses! 
= lee. seS?? (8) 
K,, seéS! 
’ 7 ra 2 . E Sian ; ) 


Let us take the computational domain with the following parameters: /, — characteristic size on axis Ox; /, — on 
axis Oy ; |. — on axis Oz. We compare a uniform computational grid of the following type to the specified area: 


W =x, =th,,y, = jh, 2, = kh, 


i=0,n, -1,j =0,n, -1,k =0,n, -1; (10) 
(n, —1)h, a L,,(n, —1)h, = La, —1)h, ee L}, 
where h,, h,, h, — steps of the computational grid in the corresponding spatial directions; n,, n y> 2, —number of 


nodes of the computational grid in the corresponding spatial directions. 


Then, we represent the set of nodes of the computational grid in the form 


G={g,,.1=0,n,-Lj=0,n,-Lk=0,n, -1}, 


Si, j,k = (X15 j2 24) oe 
where g; ,, — computational grid node. 
The number of nodes of the computational grid N,, is calculated from formula 
Ng =n, -n,°N,. (12) 


By the subsection of the computational grid G4 CG (hereinafter — subsection), we mean a subset of the nodes of 


the computational grid G. 


G= U Gr=f{gh|3 eK, gheG lt, 1 Gi=O, (13) 
ky eKy kek, 
where K, = {Lag a — multiple indices of subsections G* of computational grid G; N, — number of subsections 


G4; K,, ; N,, <N; N —set of natural numbers; k, — index of subsection G* . 
Since G4 CG, then 
Gi ={ gh, .i=0.n, 1.70.0) -Lk=0.n, =i], (14) 
where gi , — hode of subsection k,; sign ~ indicates belonging to the subsection; j — node index of subsection 


k, by coordinate y ; ny — number of nodes in subsection k, by coordinate y. 
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ky 


8ij.k 7 (5 ¥) 24), 


k=l 7 15 
5 = thay =( Eat +7} hye =H, 9) 


where n> — number of nodes by coordinate y of 5, -th subsection. 


Under the block of computational grid G4.* (hereinafter — block), we mean a subset of the computational grid 


nodes of subsection G*. 


Gh= YU Gib ={grn [3k eK, ,, ge eGhk}, Gh =O, (16) 


ky ER by Ky ER by 
where K, ,, = aN a — multiple indices of block G4-% of subsection G4; N,, ,, — number of blocks Gi-% ; 
Ky io Net, CN ky — index of block G4-© of subsection G*. 


Since G4i.% c G4, then 


Giuk = {gis ,i=0,n,-1,j7 =0,n'* -1,k =0,n, =} ; (17) 
where oe — node of block k,,k,; sign * indicates belonging to the block; j — node index of block k,,k, by 


coordinate y ; n° — number of nodes in block k,,k, by coordinate y . 
kyky 
Br = (Ris Io 2)> 


ky -1 Nig ty 


- 18 
x, =ih,, y,=|>X ¥ nb 47 \|-h,, z =hh, o) 
i x Jj jal Byel y y k Zz 


where 1} — number of nodes of block b,,p, . 


By a fragment of the computational grid Gi. (hereinafter — fragment), we mean a subset of the nodes of the 


computational grid of block G«» of subsection G*. 


Give = U Ghisko sks = {ghcbok | dk, ek, ; gihikaks (= Ghitets} 
AyeK,, 
1) Gin =D, 


ky eK ,, 


(19) 


where K, ,, ;, ={1, N } — multiple indices of fragments Ges of block G4-% of subsection Gh; N, ,, ,. — 


Bey ie 
number of fragments Ges 5 Ky yas Nu gg, CN 3 k, — index of fragment Gas of block G4. of subsection G'. 
Each index k, of fragment Gs is assigned a tuple of indices es k, Ms designed to store the fragment coordinates 
in the plane xOz , where k, — fragment index by coordinate x; k, — fragment index by coordinate z . 
k, =k, + K,, +k, (20) 
where k, — fragment index by coordinate x; k, — fragment index by coordinate z . 
Number of fragments Ga.» of block G4-% is calculated from the formula 
K, =K,,-K,,; (21) 
where K,,, — number of fragments along axis Ox; K,, — number of fragments by coordinate z . 


Since Ga-es C Ga , then 


Gone ={8,o,, 7=0,7,-1, J=0,8,-1 &=0,7,-1}, (22) 
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where Bi 55 — fragment node; sign _ indicates belonging to the fragment; ike = fragment node indices by 


coordinates x, z; m,, m, — number of nodes of the computational grid in the fragment by coordinates x, z; 


x? Zz 


/,, |, — fragment dimensions by coordinates x, z. 


834 (hs Yj Z,)s 


ky-l = “ ks 1 ih (23) 
x, -( x i, +7 y,=Jh,, % -( > A, +h 
b=1 b=1 
where n, — number of nodes of b -th fragment. 
We introduce a set of comparisons of the computational grid blocks to program flows M! 
M=U U Mis.) (24) 
Ky eKy \ ky eKy, ~ 
where M7; ,, — element of the set M'. 
Let Mj ,, — mapping block G«» to program stream 5, ,, , then 
Mit, = (Gh Ss, ke ) ? (25) 


where s, ,, €S — program flow, computing block Ga . 


In the process of solving hydrodynamic problems on three-dimensional computational grids of large dimension, 
high-performance computing systems and huge amounts of memory for data storage are needed. The resources of one 
computing device are not enough for computing and storing a three-dimensional computational grid with all its data. To 
solve this problem, various methods of decomposition of computational grids followed by the use of parallel calculation 
algorithms in heterogeneous computing environments are proposed [15]. 

For the decomposition of the computational grid, it is required to take into account the performance of computing 
devices involved in calculations. By performance, we mean the number of nodes of the computational grid calculated 
using a given algorithm per unit of time. 


Assume that all computing devices are used for calculations. Then, the total performance of the computing system 


P, is calculated from the formula 


B=P 


Nepu i ; 
3 a Nat 2. Fee Ne ’ (26) 
where P.,, — performance of a single CPU stream; N,, — number of program streams implementing the calculation 
process on the CPU; , ~~ GPU performance with index b ona single streaming multiprocessor; Nf, — number of 


CUDA streaming blocks implementing the calculation process on GPU streaming multiprocessors. 


Then, the number of nodes of the computational grid n’ in the subsection by coordinate y for each GPU with 


index b can be calculated from the formula 
P® 
nb = =| N,. (27) 


In the process of calculating by formula (27), we get the remainder — a certain number of nodes of the 


computational grid. These nodes will be located in RAM. The number of remaining nodes n’ by coordinate y is 
calculated from the formula: 

n’, =n,- > nb, (28) 

To calculate the number of nodes by coordinate y in the blocks of the computational grid processed by GPU 


streaming multiprocessors, we use the formulas: 
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nop =| —— |, b=1,N?’ -1 
yGT ’ 2 92 2 
N [ -1 (29) 
Nn’ -1 
Wer = np ~ > Moors 
where n’,, — number of nodes by coordinate y in computational grid blocks processed by GPU streaming 


multiprocessors with index b , except for the last block; ’,,,, — number of nodes by coordinate y in the last block of 


the computational grid processed by GPU streaming multiprocessors with index 5. 


To calculate the number of nodes of the computational grid by coordinate y in blocks processed by software 


streams implementing the calculation process on the CPU, we use the formulas 


ncPu 
ee et (30) 
Nyon, = NPY —Nyor (N, -1), 

where 1,., — number of nodes of the computational grid by coordinate y , processed by CPU program streams, 


except the last stream; n — number of nodes of the computational grid by coordinate y, processed by CPU 


yCTL 
program streams, in the last stream. 


Calculate the number of the computational grid fragments by coordinate y : 


Nepu 
NJ=N,+ 2 Nt. (31) 


Let the number of fragments N/ and N/ be specified by coordinates x and z, respectively. Then, the number of 


nodes of the computational grid by coordinate x is calculated using the formulas 


Nf -1 (32) 


where nf — number of nodes of the computational grid by coordinate x in all fragments, except the last fragment; 
nf — number of nodes of the computational grid by coordinate x in the last fragment. 


Similarly, the number of nodes of the computational grid is calculated by coordinate z : 


nf = z ; 
* | Nf 1 (33) 
nf =n, —n{ (Nf -1), 


where n/ — number of nodes of the computational grid by coordinate z in all fragments, except the last node; n/ — 
number of nodes of the computational grid by coordinate z in the last fragment. 

Let us describe a model of the parallel-pipeline method. Suppose it is necessary to organize a parallel process of 
computing some function F' on M', and the calculations in each fragment G«-s depend on the values in neighboring 
fragments, each of which has at least one of the indices by coordinates x, y , z, and one less than the current one 
(Fig. 1). 

To organize the parallel-pipeline method, we introduce a set of tuples A, that specify correspondences a between 
the identifiers of program streams e, processing fragments G4..s , to the step numbers of the parallel-pipeline 


method r 


Veek dacAd: a=(e, Ghee), (34) 


where r =1, N, — step number of the parallel-pipeline method; 
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N, — number of steps of the parallel-pipeline method, calculated from the formula 
N = NINT NT 1, (35) 
The full load of all calculators in the proposed parallel-pipeline method starts with step Noosrizr = Nf and ends at 
Step Toosrop = NI N/ . At the same time, the total number of steps with a full load of calculators N,p,, will be 
N par = “oostop ~Moosrarr t1= NE NS -NJ +1. (36) 


The calculation time of some function F’ by the parallel-pipeline method is written as 


= 5 max(T,), (37) 


where T, — vector of values of time spent on processing fragments in parallel mode. 


Ny-1 


(0,0,0) |(1,0,.0) | (2,0,0) | G,0,0) (N,-1,0,0) 
eo —p,» 0 r=0 r=1 r=2 r33 f r= Ny-1 


(0.1.0) (1,1,0) | (2,1,0) | (3,1,0) ee) 
ca 1 r=2 r=3 


0,2,0 1,2,0) | (2,2,0 4.2.0 (N,-1,2,0) 
(0,3,0) (1,3,0) (2,3,0) (3,3,0) o8 S 
e3 —p» 3 r=3 as if 


Fig. 1. Parallel-pipelined computing process 


Research Results. The computational experiments were carried out on K-60 high-performance computing system 
of the Keldysh Applied Mathematics Institute, RAS. A GPU section was used, each node of which was equipped with 
two Intel Xeon Gold 6142 v4 processors, four Nvidia Volta GV100GL video adapters and 768 GB of RAM. 

The computational experiment consisted of two stages — preparatory and basic. At the preparatory stage, the 
correctness of the decomposition of the computational domain into subsections, blocks and fragments was checked by 
step-by-step comparison of values in the nodes of the initial grid and in fragments obtained as a result of decomposition. 
Then, the operation of the flow control algorithm during which the time spent on calculating 1, 8, 16 and 32 fragments 


of the computational grid with a dimension of 50 nodes by spatial coordinates x, y, z, and the same number of CPU 
streams N,, was checked by the iterative alternating-triangular method in parallel mode. Ten repetitions were 
performed with the calculation of the arithmetic mean T, and standard deviation o. Based on the data obtained, 
time T!=T,/N,, spent by each stream on processing one fragment of the computational grid and acceleration 
E=T'(N,)/T!(), equal to the ratio of the processing time T!(N,) of one fragment N,, by streams to the 
corresponding processing time by one stream T!(1) was calculated. The experimental data are given in Table 1. The 


experiment showed that the standard deviation had the smallest value in the case of using 32 parallel CPU streams and 
was 0.026 ms, i.e., using 32 parallel CPU streams when calculating 32 fragments of the computational grid gave a more 
uniform time load of program streams, which generally increased the efficiency of the computing node. At the same 
time, the average value of the calculation of one fragment was 4.14 ms. The dependence of acceleration F on the 


number of streams turned out to be linear E = 0.603+0.804N,, , with a coefficient of determination equal to 0.99. We 
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have found that with an increase in the number of streams, the acceleration of the developed algorithm increases. This 


indicates the efficient use of the subsystem when working with memory. 


Table 1 
Results of the preparatory stage of the computational experiment 


Ng max(T,) , ms o,ms T,, ms E 
1 3.38 0.141 3.38 1.00 
8 3.66 0.042 0.46 7.39 
16 3.94 0.028 0.25 13.73 
32 4.14 0.026 0.13 26.13 


At the basic stage of the computational experiment, a three-dimensional computational domain having dimensions 
of 1,600; 1,600; 200 by spatial coordinates x, y and z, accordingly, was divided into 32 fragments of 50 nodes for 
each of the coordinates x and y. The division into fragments by coordinate z is given in Table 2. For each 
decomposition option with a tenfold repetition, the processing time of the entire computational grid was measured by 


the proposed parallel-conveyor method, and its average value T 


ym Was calculated. Acceleration E,,,, was calculated as 


ratio T,,, to time T,,, of the calculation by sequential version of the algorithm, equal to 6.963 ms. Regression equation 
Ey = 7.3541.97In(N/) with a determination coefficient equal to 0.94 was obtained. Analysis of the results of the 
basic stage of the computational experiment showed a significant slowdown in growth E,,, at N/ >10. Therefore, we 


conclude that splitting into fragments by coordinate z by an amount not exceeding 10 is optimal. 


Table 2 
Results of the main stage of the computational experiment 
Nf nf T,, > Ms Em 
1 200 1033.20 6.74 
2 100 779.00 8.94 
4 50 651.90 10.68 
8 25 588.35 11.84 
20 10 550.22 12.66 


Discussion and Conclusion. As a result of the conducted research, a model of a parallel-pipeline computing process 
was developed by the example of one of the most intensive stages of solving a system of grid equations by a modified 
alternating-triangular iterative method. Its construction was based on decomposition models of a three-dimensional 
uniform computational grid, taking into account the technical characteristics of the equipment used in the calculations. 

The results obtained under the computational experiments validated the effectiveness of the developed method. The 
correctness of the decomposition of the computational domain into subsections, blocks and fragments was also 
confirmed. The operation of the flow control algorithm was verified. At the same time, it was revealed that the standard 
deviation had the smallest value in the case of using 32 parallel CPU streams and is 0.026 ms, i.e., using 32 parallel 
CPU streams when calculating 32 fragments of the computational grid gave a more uniform time load of program 
streams. Here, the average value of the calculation of one fragment was 4.14 ms. 

The results of processing the measurements of the calculation time by the proposed parallel-conveyor method 
showed a significant slowdown in the growth of acceleration when divided into fragments by coordinate z at N/ >10. 
It was found that splitting into fragments by coordinate z by an amount not exceeding 10 was optimal. 
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